{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Short Introduction to NeuRec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "[NeuRec](https://github.com/wubinzzu/NeuRec) is a library of recommender systems.\n",
    "It was developed with a focus on enabling fast running recommendation model.\n",
    "Based on NeuRec, researchers are able to go from idea to result without concern for trivia, such as data preprocessing and loading, parameter passing, data sampling and iterating, model evaluating, result recording, etc.\n",
    "\n",
    "The main components and its brief introduction:\n",
    "\n",
    "- `Configurator`: This class can reads arguments from an *ini-style* configuration file and parse arguments from the command line simultaneously.\n",
    "The values of arguments are converted from `str` to `int`, `float`, `bool`, `list` and `None` automatically.\n",
    "\n",
    "- `Dataset`: This class remaps the IDs of users and items to consecutive numbers, filters users and items with few interactions, and splits data with leave-one-out or fold-out (ratio) by time or randomly.\n",
    "\n",
    "- `DataIterator`: This class combines some data sets and provides a batch iterator over them.\n",
    "\n",
    "- `ParallelSampler`: This is an abstract class and aims to parallelize the processes of training and negative item sampling.\n",
    "The parallelization and sample iteration are already implemented in\n",
    "this abstract class. Every subclass only needs to provide the no argument `sampling` method.\n",
    "Some commonly used samplers have been implemented, such as `PointwiseSampler`, `PairwiseSampler`, `TimeOrderPointwiseSampler` and `TimeOrderPairwiseSampler`.\n",
    "\n",
    "- `Logger`: This is a simple encapsulation of python logging.\n",
    "This class can show a message on standard output and write it into a file simultaneously.\n",
    "This is convenient for observing and saving training results.\n",
    "\n",
    "- `ProxyEvaluator`: This is an interface to evaluate the ranking performance of recommendation models.\n",
    "The ranking metrics are configurable.\n",
    "This class and its evaluation metrics can automatically fit both leave-one-out and fold-out splitting data without specific indication.\n",
    "And the model performance can be viewed in user groups, which is activated only by an argument.\n",
    "There are two implementation versions of this class: *python* and *cpp*.\n",
    "**Based on the *cpp* implementation, we can fast evaluate recommendation models without sampling candidate negative items.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quick Start\n",
    "\n",
    "Firstly, download this repository and unpack the downloaded source to a suitable location.\n",
    "\n",
    "Secondly, go to '*./NeuRec*' and compline the evaluator of cpp implementation with the following command line:\n",
    "\n",
    "```bash\n",
    "python setup.py build_ext --inplace\n",
    "```\n",
    "\n",
    "If the compilation is successful, the evaluator of cpp implementation will be called automatically.\n",
    "Otherwise, the evaluator of python implementation will be called.\n",
    "\n",
    "**Note that the cpp implementation is much faster than python.**\n",
    "\n",
    "Thirdly, specify dataset and recommender in configuration file *NeuRec.properties*.\n",
    "\n",
    "Finally, run [main.py](./main.py) in IDE or with command line:\n",
    "\n",
    "```bash\n",
    "python main.py\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An Example of Matrix Factorization Model\n",
    "\n",
    "This example aims to describe the building blocks of NeuRec.\n",
    "\n",
    "Following this example, researchers can fast implement their idea and conduct experiments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configurator\n",
    "\n",
    "First, we use `Configurator` to read the parameters from *NeuRec.properties*.\n",
    "If there is more than one section in the configuration file, `Configurator` will read parameters from `default_section` section.\n",
    "\n",
    "Note that, this class can also parse parameters from the command line simultaneously:\n",
    "\n",
    "```bash\n",
    "python main.py --recommender=MF --learning_rate=0.001 --batch_size=128\n",
    "```\n",
    "\n",
    "And the parameters from the command line will cover the parameters of configuration files.\n",
    "\n",
    "Besides the *NeuRec.properties* and command line, `Configurator` also read model parameters from file '*`config_dir`/`recommender`.properties*', where `config_dir` and `recommender` are specified in *NeuRec.properties*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from util import Configurator\n",
    "conf = Configurator(\"NeuRec.properties\", default_section=\"hyperparameters\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The type of parameters are converted automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(type(conf[\"data.input.dataset\"]), conf[\"data.input.dataset\"])\n",
    "print(type(conf[\"recommender\"]), conf[\"recommender\"])\n",
    "print(type(conf[\"learning_rate\"]), conf[\"learning_rate\"])\n",
    "print(type(conf[\"group_view\"]), conf[\"group_view\"])\n",
    "print(type(conf[\"topk\"]), conf[\"topk\"])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset\n",
    "\n",
    "Dataset will be loaded and preprocessed according to the parameters in *NeuRec.properties*:\n",
    "\n",
    "- `data.column.format` specifies the format of data: *UIRT*, *UIT*, *UIR*, *UI*, where *U*, *I*, *R* and *T* indicate user, item, rating and timestamp, respectively.\n",
    "\n",
    "- `data.input.path` is the directory of dataset.\n",
    "\n",
    "- `data.input.dataset` is the name of data files without extension name, which must be '*.rating*', '*.train*' or '*.test*'. More details are in [here](#data-splitting).\n",
    "\n",
    "- `data.convert.separator` is the separator or delimiter of file columns.\n",
    "\n",
    "- `user_min` and `item_min` control the preprocessing of dataset: the users (items) whose interactions less than `user_min` (`user_min`) will be filtered.\n",
    "And then, whatever the preprocessing did or not, users and items will be remapped to consecutive numbers.\n",
    "\n",
    "- `splitter` and `ratio` <a id=\"data-splitting\"></a> specify the division method of the dataset.\n",
    "`splitter` must be *ratio*, *loo* or *given*.\n",
    "  - If `splitter` is *ratio* or *loo*, the extension of `data.input.dataset` must be '*.rating*'.\n",
    "  - And if `splitter` is *ratio*, `ratio` and `1-ratio` items of each user will be divided into the training and test sets, respectively. This manner is also named fold-out.\n",
    "  - If `splitter` is *loo*, one item of each user will be divided into the training and test sets, respectively. This manner is also named leave-one-out.\n",
    "  - If `splitter` is *given*, it indicates that the dataset has been split, and the extensions of `data.input.dataset` must be '*.train*' and '*.test*'.\n",
    "\n",
    "- `by_time` means that whether the items of each user are split by interacted time or not, if there is timestamp information in the dataset.\n",
    "\n",
    "- `rec.evaluate.neg` is the number of candidate negative items for model evaluation.\n",
    "\n",
    "  - If `rec.evaluate.neg=0`, model performance will be evaluated without negative items (i.e., with all items).\n",
    "  - If `rec.evaluate.neg` is larger than `0`, negative items will be sampled for each user.\n",
    "  - And if the dataset has been split (i.e., `splitter` is *given*) and candidate negative items have been provided, the file extension of candidate negative items must be '*.neg*' and its contents must be organized as follows, where `n` is the number of candidate item for each user and must be equal to `rec.evaluate.neg`.\n",
    "  These contents are also separated by `data.convert.separator`.\n",
    "    \n",
    "    | | | | | |\n",
    "    -|-|-|-|-\n",
    "    user1 | item11 | item12 | ... | item1n\n",
    "    user2 | item21 | item22 | ... | item2n\n",
    "    user3 | item23 | item23 | ... | item3n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from data.dataset import Dataset\n",
    "dataset = Dataset(conf)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logger\n",
    "\n",
    "Create a logger"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import os\n",
    "from util import Logger\n",
    "\n",
    "timestamp = time.time()\n",
    "data_name = dataset.dataset_name\n",
    "model_name = conf[\"recommender\"]\n",
    "\n",
    "param_str = \"%s_%s\" % (data_name, conf.params_str())\n",
    "run_id = \"%s_%.8f\" % (param_str[:150], timestamp)\n",
    "\n",
    "log_dir = os.path.join(\"log\", data_name, model_name)\n",
    "logger_name = os.path.join(log_dir, run_id + \".log\")\n",
    "logger = Logger(logger_name)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show the statistic of dataset, `logger.info()` print a message on standard output and write it into a file simultaneously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"log file is:\\t\", logger_name)\n",
    "logger.info(dataset)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ProxyEvaluator\n",
    "\n",
    "`ProxyEvaluator` is the interface to evaluate models and contains various evaluation protocols.\n",
    "\n",
    "- Evaluation metrics are configurable via the argument `metric`.\n",
    "If `metric` is `None`, metric will be set to `[\"Precision\", \"Recall\", \"MAP\", \"NDCG\", \"MRR\"]`.\n",
    "Otherwise, `metric` must be one or a sublist of metrics mentioned above.\n",
    "\n",
    "- Evaluation metrics can automatically fit both leave-one-out and fold-out data splitting without specific indication.\n",
    "In leave-one-out evaluation:\n",
    "  - `Recall` is equal to `HitRatio`;\n",
    "  - The implementation of `NDCG` is compatible with fold-out;\n",
    "  - `MAP` and `MRR` have the same numeric values;\n",
    "  - `Precision` is meaningless.\n",
    "\n",
    "- `top_k` controls the Top-K item ranking performance. Defaults to `50`.\n",
    "  - If `top_k` is an integer, K ranges from `1` to `top_k`;\n",
    "  - If `top_k` is a list of integers, K are only assigned these values.\n",
    "\n",
    "- The ranking performance of models can be viewed in user groups, which are split according to the numbers of users' interactions in training data.\n",
    "This function can be activated by the argument `group_view`.\n",
    "  - If `group_view` is `None`, the ranking performance will be viewed without groups.\n",
    "  - If `group_view` is a list of integers, the ranking performance will be view in groups.\n",
    "  - For example, if `group_view = [10,30,50,100]`, users will be split into four groups: `(0, 10]`, `(10, 30]`, `(30, 50]` and `(50, 100]`. And the users whose interacted items more than `100` will be discarded.\n",
    "\n",
    "- **Importantly**, there are two implementation versions of this class: *python* and *cpp*.\n",
    "And both of the two versions are multi-threaded.\n",
    "Based on the *cpp* implementation, we can fast evaluate recommendation models without sampling candidate negative items.\n",
    "\n",
    "Using the following command to compile cpp code:\n",
    "\n",
    "```bash\n",
    "python setup.py build_ext --inplace\n",
    "```\n",
    "\n",
    "After the successful compilation, the cpp version will be called automatically.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from evaluator import ProxyEvaluator\n",
    "\n",
    "evaluator = ProxyEvaluator(dataset.get_user_train_dict(),\n",
    "                           dataset.get_user_test_dict(),\n",
    "                           dataset.get_user_test_neg_dict(),\n",
    "                           metric=[\"Precision\", \"NDCG\"],\n",
    "                           group_view=conf[\"group_view\"],\n",
    "                           top_k=[10, 20],\n",
    "                           batch_size=conf[\"test_batch_size\"],\n",
    "                           num_thread=conf[\"num_thread\"])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Matrix factorization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "\n",
    "class MF(object):\n",
    "    def __init__(self, dataset, conf):\n",
    "        self.learning_rate = conf[\"learning_rate\"]\n",
    "        self.embedding_size = conf[\"embedding_size\"]\n",
    "        self.reg_mf = conf[\"reg_mf\"]\n",
    "        self.stddev = conf[\"stddev\"]\n",
    "        self.num_users = dataset.num_users\n",
    "        self.num_items = dataset.num_items\n",
    "    \n",
    "    def _create_placeholder(self):\n",
    "        self.user_ph = tf.placeholder(tf.int32, shape=[None])\n",
    "        self.pos_item_ph = tf.placeholder(tf.int32, shape=[None])\n",
    "        self.neg_item_ph = tf.placeholder(tf.int32, shape=[None])\n",
    "    \n",
    "    def _create_variables(self):\n",
    "        initializer = tf.random_uniform_initializer(-self.stddev, self.stddev)\n",
    "        self.user_embeddings = tf.Variable(initializer([self.num_users, self.embedding_size]), dtype=tf.float32)\n",
    "        self.item_embeddings = tf.Variable(initializer([self.num_items, self.embedding_size]), dtype=tf.float32)\n",
    "    \n",
    "    def build_model(self):\n",
    "        self._create_placeholder()\n",
    "        self._create_variables()\n",
    "        user_emb = tf.nn.embedding_lookup(self.user_embeddings, self.user_ph)\n",
    "        pos_item_emb = tf.nn.embedding_lookup(self.item_embeddings, self.pos_item_ph)\n",
    "        neg_item_emb = tf.nn.embedding_lookup(self.item_embeddings, self.neg_item_ph)\n",
    "\n",
    "        pos_rating = tf.reduce_sum(tf.multiply(user_emb, pos_item_emb), 1)\n",
    "        neg_rating = tf.reduce_sum(tf.multiply(user_emb, neg_item_emb), 1)\n",
    "        \n",
    "        # ranking loss\n",
    "        ranking_loss = -tf.reduce_sum(tf.log_sigmoid(pos_rating-neg_rating))\n",
    "        \n",
    "        # reg loss\n",
    "        params = [user_emb, pos_item_emb, neg_item_emb]\n",
    "        reg_loss = tf.add_n([tf.nn.l2_loss(w) for w in params])\n",
    "\n",
    "        final_loss = ranking_loss + self.reg_mf*reg_loss\n",
    "        \n",
    "        self.update_opt = tf.train.AdamOptimizer(self.learning_rate).minimize(final_loss)\n",
    "\n",
    "        # for evaluating model\n",
    "        self.predict_ratings = tf.matmul(user_emb, self.item_embeddings, transpose_b=True)\n",
    "\n",
    "        self.sess = tf.Session()\n",
    "        self.sess.run(tf.global_variables_initializer())\n",
    "    \n",
    "    def train_step(self, users, pos_items, neg_items):\n",
    "        feed_dict = {self.user_ph: users, \n",
    "                     self.pos_item_ph: pos_items, \n",
    "                     self.neg_item_ph: neg_items}\n",
    "        self.sess.run(self.update_opt, feed_dict=feed_dict)\n",
    "\n",
    "    def predict(self, users, candidate_items=None):\n",
    "        # evaluator will call this method to get item ranking scores \n",
    "        # and evaluate model performance\n",
    "        ratings = self.sess.run(self.predict_ratings, feed_dict={self.user_ph: users})\n",
    "        return ratings\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ParallelSampler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from data import PairwiseSampler\n",
    "\n",
    "data_sampler = PairwiseSampler(dataset, neg_num=1, batch_size=conf[\"batch_size\"], shuffle=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MF(dataset, conf)\n",
    "model.build_model()\n",
    "\n",
    "num_epochs = conf[\"epochs\"]\n",
    "batch_size = conf[\"batch_size\"]\n",
    "\n",
    "logger.info(conf)\n",
    "\n",
    "logger.info(evaluator.metrics_info())\n",
    "for epoch in range(num_epochs):\n",
    "    for bat_users, bat_items_pos, bat_items_neg in data_sampler:\n",
    "        model.train_step(bat_users, bat_items_pos, bat_items_neg)\n",
    "    logger.info(\"epoch: %d:\\t%s\"%(epoch, evaluator.evaluate(model)))\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3.6.9 64-bit ('tf114': conda)",
   "language": "python",
   "name": "python36964bittf114condaf3caffa36f11483a823506e6e8d43a6b"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9-final"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
